-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefer tesserocr over easyocr, if available #369
Conversation
When setting up our ingestion pipeline, explicitly check if tesserocr is available and Docling can load it. If so, prefer that. Otherwise, attempt the same for EasyOCR. If neither can load, log an error and disable optical character recognition. Fixes instructlab#352 Signed-off-by: Ben Browning <[email protected]>
faec215
to
ba00454
Compare
Thanks for the approval! After following some discussion elsewhere about being careful when we import anything that imports all of |
This borrows and adapts the `leanimports.py` script and test from the InstructLab CLI repository to ensure within SDG we're not prematurely loading the entirety of Torch into memory. The CLI repo noticed we were doing this, and since this PR would actually have exacerbated this by attempting to load the tesseract and easyocr modules even earlier, this felt like the right time to address this. The overall imports are all the same, but now we only import specific docling pieces as needed when we're actually going to run chunking vs triggering the whole PyTorch import chain as soon as someone imports SDG. Signed-off-by: Ben Browning <[email protected]>
Ok, removing the hold now that we're not importing all of Pytorch as soon as someone imports SDG. Instead, we defer that until Docling actually needs |
Thanks for taking care of this, Ben! 😁 |
@Mergifyio backport release-v0.5 |
✅ Backports have been created
|
Prefer tesserocr over easyocr, if available (backport #369)
When setting up our ingestion pipeline, explicitly check if tesserocr is available and Docling can load it. If so, prefer that. Otherwise, attempt the same for EasyOCR. If neither can load, log an error and disable optical character recognition.
Fixes #352